Exploring Genomic Data

This notebook demonstrates working with genetic variant data stored as publicly accessible Google BigQuery datasets.

Specifically, we will work with the Illumina Platinum Genomes data. The source data was originally in VCF format, which was imported to Google Genomics and exported to BigQuery.

If you want to explore other genomics samples, see https://github.com/googlegenomics/datalab-examples. You can import them into your Google Cloud Datalab instance by uploading them from the notebook list page.


In [1]:
import google.datalab.bigquery as bq

Variants

Let's start with a little bit of genomics nomenclature:

  • Variant: A region of the genome that has been identified as differing from the reference genome.
  • Non-variant Segment: A region of the genome that matches the reference genome.
  • Reference Name: The name of a reference segment of DNA. Typically, this is a chromosome, but it may be other named regions from a reference genome.
  • Reference Bases: The A's, C's, T's, and G's comprising the DNA of the reference genome at the site of variation.
  • Alternate Bases: The A's, C's, T's, and G's comprising the DNA actually found at the site of variation for the sample(s) sequenced.
  • SNP: A single nucleotide polymorphism is a DNA sequence variation in which a single nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes.

In [2]:
variants = bq.Table('genomics-public-data.platinum_genomes.variants')
variants.length


Out[2]:
261285806

Wow, that's a lot of records! For a detailed description of the schema, see Understanding the BigQuery Variants Table Schema.

SNPs

Next, let's take a look at a few SNPs in our data. As mentioned in the nomenclature, SNPs are a particular kind of genetic variant. Let's create an SQL statement that can list all variant records for a single base change:


In [3]:
%%bq query --name single_base
SELECT
  reference_name,
  start,
  reference_bases,
  alternate_bases
FROM
  `genomics-public-data.platinum_genomes.variants`
WHERE
  reference_bases IN ('A','C','G','T') AND
  ARRAY_LENGTH(alternate_bases) = 1
LIMIT 100

In [4]:
%bq execute --query single_base


Out[4]:
reference_namestartreference_basesalternate_bases
chr1355705238A['C']
chr687341707A['C']
chr9135065442A['C']
chr3172106165A['G']
chr1236345163A['G']
chr1497175506A['G']
chr2160217510A['G']
chr5133727302A['G']
chr2130951580A['G']
chr693866087A['G']
chr855417271A['G']
chr523130840A['G']
chr1195554304A['G']
chr116141497A['G']
chr1238015193A['G']
chr67423271A['G']
chr4146456188A['G']
chr535870587A['G']
chr52578034A['G']
chr349026325A['T']
chr1831403838A['T']
chr732113497A['T']
chr723822179A['T']
chr139505431C['A']
chrX3515518C['A']

(rows: 100, time: 0.6s, cached, job: job_6ThkJiu7kjuyGN7tfISCqxRnpas)

Transition/Transversion Ratio

A SNP can be further classified as either a transition or a transversion. The ratio of transitions to transversions (TiTv ratio) in humans is observed to be approximately 2.1, but this is not uniform across the genome. Let's take a closer look by computing the TiTv ratio in contiguous regions of 100,000 base pairs.


In [5]:
%%bq query --name snps
SELECT
  reference_name,
  reference_bases,
  alternate_bases,
  CAST(FLOOR(start / 100000) AS INT64) AS windows,
  CONCAT(reference_bases, CONCAT(CAST('->' AS STRING), ARRAY_TO_STRING(alternate_bases, ''))) AS mutation,
  ARRAY_LENGTH(alternate_bases) AS num_alts
FROM
  `genomics-public-data.platinum_genomes.variants`
WHERE reference_name IN ("1", "chr1")
  # Limit to bi-allelic SNP variants
  AND reference_bases IN ('A','C','G','T')
  AND ARRAY_LENGTH(alternate_bases) = 1

We've updated the above query to include the "window" in which the SNP resides, and added a new field called "mutation".


In [6]:
%bq sample -q snps --count 10


Out[6]:
reference_namereference_basesalternate_baseswindowsmutationnum_alts
chr1A['G']1062A->G1
chr1A['G']1203A->G1
chr1C['T']2020C->T1
chr1C['A']2372C->A1
chr1G['A']2244G->A1
chr1G['C']31G->C1
chr1G['T']219G->T1
chr1G['A']1142G->A1
chr1G['A']58G->A1
chr1T['A']2259T->A1

(rows: 10, time: 1.5s, 4GB processed, job: job_F7w9isHGHamer5Ts_Hgb1ykt6R8)

Next we group and classify the SNPs within their windows.


In [7]:
%%bq query --name windows --subqueries snps
SELECT
  reference_name,
  windows AS win,
  COUNTIF(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
  COUNTIF(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
                       'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
  COUNT(mutation) AS num_variants_in_window
  FROM snps
GROUP BY
  reference_name,
  win

In [8]:
%bq sample -q windows --count 10


Out[8]:
reference_namewintransitionstransversionsnum_variants_in_window
chr1189310364183
chr169113089235
chr11071139108262
chr11560144108266
chr11132164115294
chr1250137142303
chr12029245164439
chr168194176406
chr12005238180455
chr11615371227609

(rows: 10, time: 4.8s, 4GB processed, job: job_2EkOE9dx4nB82TP5mvzdof3TObA)

And finally, we compute the per-window TiTv ratio.


In [9]:
%%bq query --name titv --subqueries snps windows
SELECT
  reference_name,
  win * 100000 AS window_start,
  transitions,
  transversions,
  transitions/transversions AS titv,
  num_variants_in_window
FROM windows
ORDER BY
    window_start

In [10]:
%bq sample -q titv --count 10


Out[10]:
reference_namewindow_starttransitionstransversionstitvnum_variants_in_window
chr102321561.48717948718397
chr1100000118562.10714285714177
chr120000055371.4864864864994
chr1300000180.1259
chr140000021111.9090909090932
chr1500000160493.26530612245210
chr160000063431.46511627907107
chr17000002782521.10317460317544
chr18000009066191.463651050081562
chr19000003653001.21666666667698

(rows: 10, time: 2.1s, 4GB processed, job: job_k3CAnEQGwFwNyNlIk9DWCsZZvQM)

In [11]:
titvRatios = titv.execute(output_options=bq.QueryOutput.dataframe()).result()
titvRatios[:5]


Out[11]:
reference_name window_start transitions transversions titv num_variants_in_window
0 chr1 0 232 156 1.487179 397
1 chr1 100000 118 56 2.107143 177
2 chr1 200000 55 37 1.486486 94
3 chr1 300000 1 8 0.125000 9
4 chr1 400000 21 11 1.909091 32

Visualization

Now we can take the ratios and plot them by genomic position to see how the ratio varies depending upon where it was computed within the chromosome. By default, this plot shows chromosome 1 with its gap in the center of the data corresponding to its metacentric contromere.


In [14]:
import seaborn as sns
import matplotlib.pyplot as plt

g = sns.lmplot(x='window_start',
               y='titv',
               data = titvRatios,
               size = 8,
               aspect = 2,
               scatter_kws = { 'color': 'black', 'alpha': 0.4 },
               line_kws = { 'color': 'blue', 'alpha': 0.5, 'lw': 4 },
               lowess = True)
plt.xlabel('Genomic Position', fontsize = 14)  
plt.ylabel('Ti/Tv Ratio', fontsize = 14)
plt.title('Ti/Tv Ratio computed on %d base pair windows for %s' % (100000, ['1', 'chr1']),
          fontsize = 18)
plt.annotate('Centromere', xy = (1.3e8, 1.5), xytext = (1.5e8, 3), size = 14,
             arrowprops=dict(facecolor='black', shrink=0.05))


Out[14]:
<matplotlib.text.Annotation at 0x7f68e3e61cd0>

In [ ]: